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Abstract 

Respondent-driven sampling (RDS) is currently widely used for the study of HIV/ AIDS -related high risk popu- 
lations. However, recent studies have shown that traditional RDS methods are likely to generate large variances and 
may be severely biased since the assumptions behind RDS are seldom fully met in real life. To improve estimation 
in RDS studies, we propose a new method to generate estimates with ego network data, which is collected by asking 
RDS respondents about the composition of their personal networks, such as "what proportion of your friends are 
married?". By simulations on an extracted real-world social network of gay men as well as on artificial networks with 
varying structural properties, we show that the new estimator, RDS I eg0 shows superior performance over traditional 
RDS estimators. Importantly, RDSI eg0 exhibits strong robustness to the preference of peer recruitment and varia- 
tions in network structural properties, such as homophily, activity ratio, and community structure. While the biases 
of traditional RDS estimators can sometimes be as large as 10% ~ 20%, biases of all RDSI eg0 estimates are well 
restrained to be less than 2%. The positive results henceforth encourage researchers to collect ego network data for 
variables of interests by RDS, for both hard-to-access populations and general populations when random sampling is 
not applicable. The limitation of RDS I eg0 is evaluated by simulating RDS assuming different level of reporting error. 

Keywords: networks, ego networks, respondent-driven sampling, differential recruitment, reporting error 



1. Introduction 

In many forms of research, there is no list of all members for the studied population (i.e., a sampling frame) from 
which a random sample may be drawn and estimates about the population characteristics may be inferred based on 
the select probabilities of sample units. Non-probability sampling methods may be used for for such situations, such 
as key informant sampling [1], targeted/location sampling [2], and snowball sampling [3]. However, these methods 
all introduce a considerable selection bias, which impairs generalization of the findings from the sample to the studied 
population [4, 5]. Respondent-driven sampling (RDS) is an alternative method that is currently being used extensively 
in public health research for the study of hard-to-access populations, e.g., injecting drug users (IDUs), men who have 
sex with men (MSM) and sex workers (SWs). With a link- tracing network sampling design, the RDS method provides 
unbiased population estimates as well as a feasible implementation, making it the state-of-the-art sampling method 
for studying hard-to-access populations [6, 7, 8, 9, 10]. 

RDS starts with a number of pre-selected respondents who serve as "seeds". After an interview, the seeds are 
asked to distribute a certain number of coupons (usually 3) to their friends who are also within the studied population. 
Individuals with a valid coupon can then participate in the study and are provided the same number of coupons to 
distribute. The above recruitment process is repeated until the desired sample size is reached [4]. In a typical RDS, 
information about who recruits whom and the respondents' number of friends within the population (degree) are also 
recorded for the purpose of generating population estimates from the sample [11, 12]. 
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Suppose a RDS study is conducted on a connected network with the additional assumptions that (i) network links 
are undirected, (ii) sampling of peer recruitment is done with replacement, (iii) each participant recruits one peer from 
his/her neighbors, and (iv) the peer recruitment is a random selection among all the participant's neighbors. Then the 
RDS process can be modeled as a Markov process, and the composition of the sample will stabilize and be independent 
of the properties of the seeds [12, 13, 14]. Following this, the probability for each node to be included in the RDS 
sample is proportional to its degree. Specifically, for a given sample U = {v\, V2, • • • , v n }, with ua being the number 
of respondents in the sample with property A (e.g., HIV-positive) and n B = n - ha being the rest. Let {d\, di, . .-,d n } 

be the respondents' degree and S = Saa Sab be the recruitment matrix observed from the sample, where s X y is the 

[sba sbb\ 

proportion of recruitments from group X to group Y (for the purpose of this paper, we consider a binary property such 
that each individual belongs either to group A or B). Then the proportion of individuals belonging to group A in the 
population, P* A , can be estimated by [12, 14]: 

sbaDb {RDS1X (1) 



or 



sabD a + sbaD b 



7-1 



Pa = v -^4r (2) 

ViEU 

where Da = ^ A d -\ and D B = ^ B d -i are me estimated average degrees for individuals of group A and B in the 

population. Both estimators give asymptotically unbiased estimates. The estimation procedure above is also called a 
re- weighted random walk (RWRW) in other fields [15]. 

The methodology of RDS is nicely designed; however, the assumptions underlying the RDS estimators are rarely 
met in practice [7, 16, 17]. For example, empirical RDS studies use more than one coupon and sampling is conducted 
without replacement, that is, each respondent is only allowed to participate once. A comprehensive evaluation has 
been made by Lu et al [18], where the effects of violation of assumptions (i)~(iv), as well as the effect of selection and 
number of seeds and coupons, were evaluated one by one, by simulated RDS process on an empirical MSM network as 
well as artificial networks with known population properties. They have shown that when the sample size is relatively 
small (< 10% of the population), RDS estimators have a strong resistance to violations of certain assumptions, such 
as low response rate and errors in self-reporting of degrees, and the like. On the other hand, large bias and variance 
may result from differential recruitments, or from networks with irreciprocal relationships. When the sample size is 
relatively large (> 50% of the population), similar results were also found by Gile and Handcock [19], where they 
focused on the sensitivity of RDS estimators to the selection of seeds, respondent behavior and violation of assumption 
(ii). 

It was not until recently that researchers found the variance in RDS may have been severely underestimated [20] . 
In a study by Goel and Salganik [17] based on simulated RDS samples on empirical networks, they found that the 
RDS estimator typically generates five to ten times greater variance than simple random sampling [20] . Moreover, 
McCreesh et al [21] conducted a RDS study on male household heads in rural Uganda where the true population 
data was known, and they found that only one-third of RDS estimates outperformed the raw proportions in the RDS 
sample, and only 50%-74% of RDS 95% confidence intervals, calculated based on a bootstrap approach for RDS, 
included the true population proportion. 

For the above reasons, there has been an increasing interest in developing new RDS estimators to improve the 
performance of RDS. For example, Gile [22] developed a successive-sampling-based estimator for RDS to adjust the 
assumption of sampling with replacement and demonstrated its superior performance when the size of the population 
is known. Lu et al [23] proposed a series of new estimators for RDS on directed networks, with known indegree 
difference between estimated groups. Both of the above estimators can be used as a sensitivity test when the required 
population parameters are not known. 

Both the traditional RDS I, RDS II estimators, and the estimators newly developed by Gile et al [22, 24] and Lu et 
al [23] utilize the same information collected by standard RDS practice, that is, the recruitment matrix S , respondents' 
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degree, and sample property. There is however scope to improve estimates dramatically if data on the composition 
of respondents' ego networks can be put to use. Such data has already been collected for other purposes in many 
RDS studies. For example, in a RDS study of MSM in Campinas City, Brazil, by de Mello et al [25], respondents 
were asked to describe the percentage of certain characteristics among their friends/acquaintances, such as disclosure 
of sexual orientation to family, HIV status, and the like. In a RDS study of opiate users in Yunnan, China, various 
information about supporting, drug using, and sexual behaviors between respondents and their network members 
were collected [26] . One of the most thorough RDS studies utilizing ego network information was done by Rudolph 
et al [27], in which they asked the respondents to provide extensive characteristics for each alter within their personal 
networks such as demographic characteristics, history of incarceration, and drug injection and crack and heroin use. 

Aiming to improve the RDS estimator, we will focus on how to integrate this additional information in the esti- 
mation process to generate improved population estimates. The rest of this paper is organized as follows. In Section 
2, we develop a new estimator that integrates traditional RDS data with egocentric data; in Section 3, we describe 
network data used for simulation and study design; in Section 4, we evaluate the performance of the new estimator by 
simulated RDS processes under various settings; and in Section 5, we summarize and draw our conclusions. 



2. RDSI eg0 : estimator for RDS with egocentric data 

The ego networks from a RDS sample differ from general egocentric data collected in many sociological sur- 
veys [28] in the way that each "ego" is connected with (recruited by) its recruiter. For example, in a partial chain of 
RDS as illustrated in Figure 1, participants v,-, Vj, v&, are asked to provide personal network compositions and Vj and 
Vk are recruited by v ; , v 7 , respectively. 

For each respondent Vj in a RDS sample U = {v\ , V2, . . . , v n ], let nf, nf be the number of Vj's friends with property 
A, B, respectively. We then start to show how to integrate the ego network information for estimating the proportion 
of individuals with property A in the population, P* A . Assuming that the RDS process is conducted on a connected, 
undirected network with assumptions (i)-(iii) fulfilled, the probability that each node will be included in the sample, 
Pr(v/), will be proportional to its degree [12, 13, 14]: 

2u j= i fly- 
where N is the size of the population of interest. 

Consequently, the probability that each link e^j will be selected to recruit a friend, Pr(^ 7 ), depends on Pr(v/). 
Under the random recruitment assumption, we have: 

Pr(e i ^ j ) = Pr(vd-j~-^—, (4) 

that is, each link has the same probability of being selected via the RDS process. Consequently, the observed 
recruitment matrix S is a random sample for the cross-group links of the network [12]. 

The above are general inferences from a typical RDS process. Up to now, we can turn our attention to the 
egocentric data source. Let Pr(e^) be the probability that link e^j will be reported by "ego" Vf, since e^j is reported 
as long as v* is included in the sample, then: 

Pr«r i ) = Pr(v i )~-#-. (5) 
2ij=i a j 

Consequently, to estimate the proportion of type e x ^y (X Y € {A, B}) links in the population, s* XY , we can weigh 
the observed number of type e e ^ Y links by their inclusion probability to construct a generalized Hansen-Hurwitz 
estimator [29] : 

fiego y ^_ 

ffigo _ N XY _ UieXfW dj ^ 



XA XB ^vjeX r)Ud]+ ^vjeX f]U d] 
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Figure 1 : A RDS chain with egocentric data, (a) RDS on a network. Red nodes are those that participated in the RDS survey, and yellow nodes are 
ego network composition inferred by participants; (b) A partial RDS chain with color representing properties of nodes. 



where Ng° = Tj Vi exnu j~ i s me weighted number of type e x ^y links reported in the sample's ego networks. 
Since the denominator in (6) can be rewritten as: 

n A n B n A + n B A 

y -^- + y jl = y n j \ = y (% = n m 



we have: 



1 xn n Y 

fg = — • Y -f . (8) 

Note that in (8), the recruitment links are also counted as reported ego-alter links, and taking Figure 1 as an 
example, e^j and e^i will be counted as blue — > green type ego-alter link and green — > blue type ego-alter link, 
separately. 

Using fg° from (8) as an alternative to S , which is used in the RDS I estimator, we can estimate P* A by the same 
equation as (1). For the sake of clarity, the procedure for deriving (1) is replicated as follows: 

In an undirected network, the number of cross-group links from A to B should equal the number of links from B 
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to A: 

N A D* A s* AB = N B D B s BA . (9) 

where Na = N - N B is the number of individuals of group A in the population, and D* A , S* are average degrees 
for the two groups. 

If we let $x° be the estimator of s* XY and let D x = be tne estimator of ®x (^» Y e 5 0» men p a can be 

VfeXnU 1 

estimated by: 



s 8 D 

S BA U 



B 



(RDSI eg0 ). (10) 



S AB U & + 

Kocarl pctimotinn I" rpprnitmAnt motriv 

*XY> 



In all, the RDSI eg0 estimator uses the ego network data-based estimation of recruitment matrix, s eg y, instead of 
the observed S used in RDSI. There are at least two advantages to using s eg ° rather than s X y'- 

First, the sample size for inferring f^y, is considerably larger than that for sxy, reducing random error and making 
the estimates more reliable; 

Second, in real RDS practice, respondents can hardly recruit their friends randomly [9, 16, 25], which leads to 
unknown bias and error for the representativeness of sxy- s^y, on the other hand, takes all of an ego's links into 
consideration, and consequently avoids this problem. Even the inclusion probability for a node may be shifted away 
from Pr(v/) when there are non-random recruitments; as we will see in section 4, s e ^y can greatly reduce estimate bias 
and error for such violation of assumption. 

Note also the the implementation of RDSI eg0 does not necessarily require each respondent i to list each of her/his 
alters' property: since degree is always collected in RDS, an estimated proportion of friends with a certain property 
A, rf, would be enough to get to know the number of alters from group A, rfdi. 



3. Simulation study design 

3.1. Network data 

In this paper we use both an anonymized empirical social network and simulated networks to evaluate the perfor- 
mance of the newly proposed estimator. The empirical network, previously analyzed in [18, 23, 30], comes from the 
Nordic region's largest and most active web community for homosexual, bisexual, transgender, and queer persons. 
Nodes of the network are website members who identify themselves as homosexual males, and links are friendship 
relations defined as two nodes adding each other on their "favorite list", based on which they maintain their contacts 
and send messages. Only nodes and links within the giant connected component are used for this study, yielding a net- 
work of size N = 16082, and average degree 5* = 6.74. Four dichotomous properties from users' profiles have been 
studied: age (born before 1980), county (live in Stockholm, ct), civil status (married, cs), and profession (employed, 
pf). The population value of group proportion (P* A ), cross-group link probability (s* AB ), homophily, and activity ratio, 
are listed in Table 1 . 



Table 1: Basic statistics for variables in the MSM network 



variable 


P\ (%) 


S AB 


Homophily 


Activity ratio 


age 


77.8 


0.13 


0.40 


1.05 


ct 


38.8 


0.30 


0.50 


1.22 


cs 


40.4 


0.57 


0.05 


0.97 


Pf 


38.2 


0.54 


0.13 


1.21 



Homophily, quantified as h A = 1 - s AB /P* B , is the probability that nodes connect with their friends who are similar 
to themselves rather than randomly. If the homophily of a property is 0, it means that all nodes are connected to their 
friends purely randomly, regardless of this property; if the homophily is 1, it means that all nodes with a particular 
property are connected to friends with the same property. Activity ratio, is the ratio of mean degree for group A to 
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group B, w = D* A /D* B . Previous studies have found that homophily and activity ratio are two critical factors that 
may affect the performance of RDS estimators [19]. Generally, the larger the homophily or difference between a 
group's mean degrees, the larger will be the bias and variance of the estimates. The various levels of homophily and 
activity ratio of the four variables in the MSM network provides a rich test base for RDS estimators. For example, the 
homophily for the county is 0.50, which means that members who live in Stockholm form links with members who 
also live in Stockholm 50% of the time, while they form links randomly among all cities (including Stockholm) the 
remaining 50% of the time. The civil status has a very low level of homophily, indicating that edges are formed as if 
randomly among other members, regardless of their marital status. 

To systematically evaluate the effect of homophily and activity ratio on the performance of RDS estimators, we 
have also generated a set of simulated networks with Ha e [0, 0.5] and w e [0.5, 2.5] based on the KOSKK model, 
which is among the best social network models that can produce most realistic network structure with respect to 
degree distributions, assortativity, clustering spectra, geodesic path distributions, and community structure, and the 
like [31, 32]. These networks are configured with population size N = 10000, average degree D* = 10, and population 
value P* A = 30% (see Appendix for details). 

3.2. Study design 

Based on the MSM network and artificial KOSSK networks, RDS processes are then simulated and the sample 
proportions and estimates are compared with population value to evaluate the accuracy of different estimators. In 
particular, we consider the following aspects: 

Sample size: we set the sample size to 500. 

Sampling without replacement (SWOR): alike most empirical RDS studies, nodes are not allowed to be recruited 
again if they have already been in the sample. 

Number of seeds and coupons: following [19], we consider two scenarios: 6 seeds with 2 coupons, contributing to 
500 respondents from 6 waves, and 10 seeds with 3 coupons, contributing to 500 respondents from 4 waves. However, 
we do not find significant difference for both settings in simulations and thus choose to show results with 6 seeds and 
2 coupons. 

Random and differential recruitment: one of the assumptions that is most unlikely to be met in real life is that 
participants randomly recruit peers. For example, respondents may tend to recruit people who they think will benefit 
most from the RDS incentives [9]. In a study of MSM in Campinas City, Brazil [25], participants were reported 
most often to recruit close peers or peers they believed practiced risky behaviors. In [19, 18, 16], it has been shown 
that all current RDS estimators would generate bias when the outcome variables are related to the tendency of such 
non-random distribution of coupons among respondents' personal networks (differential recruitment). 

To test the robustness of the new estimator, we consider both scenarios. Let p d jf^ e [0, 1] be the probability that 
individuals from group A are p d ^ times more likely to be recruited by both group A members and group B members, 
then p d jff = corresponds to random recruitment, when coupons are randomly distributed to respondents' friends, 
and p d jff = 1 corresponds to the extreme case scenario that both group A members and group B members are twice 
as likely to recruit peers of type A, which would largely oversample both individuals from group A and the proportion 
of recruitment links toward group A, saa and sba- 

Reporting error about degree and ego networks: the new estimator requires respondents to report ego network 
information, bringing a new challenge in RDS. We simulate reporting error in two stages of a RDS process: first, 
when a respondent reports his or her degree, any alters of type A or B will be missed and not reported with probability 
ptniss or pmiss^ respectively; second, when the composition of an ego network is reported, any alters of type A will be 
misclassified as type B with probability p e A r ™ r B , and any alters of type B will be misclassified as type A with probability 
Pb™a Y ^ ce versa - 

RDS estimators: since previous studies have suggested that sample composition may sometimes be an even better 
approximation of P* A than traditional RDS estimators [21, 17], in addition to RDS I and RDSI e§0 , we also include 
the raw sample composition in the analysis. The RDS II estimator in our simulations provides estimates with little 
difference to RDS I and is thus not presented separately. 

Since we are interested in generating feasible population estimates by information only collected within the RDS 
sample, the newly developed estimators that require known population parameters [22, 23, 24] are thus beyond the 
purpose of this study and are excluded from comparison. 
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Four measurements are then carried out after the RDS simulations: the Bias, which is the absolute difference 

I y fit g I I y 1 m g §f- I 

between the average estimate and population value, i=1 m 1 - P*A or i=1 m 1 - s* A B , where est t is the estimate from 
the i th simulation and m the number of simulation times; the Standard Deviation (SD) of estimates; the Root Mean 

l ~ l m — — ; and lastly, the Percentage an estimator outperforms the rest in all simulations: 

z>best times the estimator gives closest estimate to s* AB or P* A 
~ m 

All simulations were repeated 10,000 times, and seeds were excluded from the calculation of estimates in this 
study. 

4. Results 

4.1. Random and differential recruitment 
4.1.1. Estimates of network link types 

The difference between RDS I and RDSI eg0 lies in the estimation of the recruitment matrix S . As a first step, we 
therefor simulate the RDS process with random recruitment (p A ^ = 0) and differential recruitment (p A ^ = 1) and 
then estimate the proportion of type e^B links in the population, s AB , by both the raw sample recruitment proportion, 
sab, and the proposed ego-network-based estimator, s AB , for all four variables in the MSM network, age, ct, cs and 
pf, respectively. 

An example of the simulation results for ct is presented in Figure 2. Clearly, when the random recruitment as- 
sumption is fulfilled (Figure 2(a)), both sab and s A ° are unbiased. Estimates by s AB peak more closely to s AB and have 
less variance than sab (SD = 0.02 compared to 0.04, see Table 2. The difference between sab and s A ° becomes more 
evident when RDS is implemented with differential recruitment. We can see from Figure 2(b) that when peers who 
live in Stockholm are two times more likely to be recruited by their friends, the raw sample recruit proportion is largely 
undersampled (Bias=0.09), while s A ° still provides robust estimates (Bias=0.01) with less variance (SD=0.02). If we 
compare the performance of estimates for each simulation under random recruitment, s A ° is 70% times closer to s* AB 
than sab- Under differential recruitment, almost all f A ° estimates (P best = 0.98) outperform sab- 




0.2 0.25 0.3 0.35 0.4 0.45 0.1 0.2 0.3 

estimates for s AB estimates for s AB 

(a) random recruitment (b) differential recruitment 



Figure 2: Distribution of RDS estimates for s* AB (ct). Dashed line shows population value s* AB {ct). (a) Participants recruit respondents randomly 
among their friends, = 0; (b) Participants are two times more likely to recruit friends of type A than friends of type B, = 1. 



Simulation results for estimates of all variables are summarized in Table 2. The conclusions are similar to those 
above. s A ° gives less bias, SD, RMSE, and gives for most instances closer estimates, regardless of homophily and 
activity ratios. The precision of sab depends largely on the random recruitment assumption; the bias and RMSE 
of sab are a maximum of 0.01 and 0.04 for all simulation settings when peers are randomly recruited, while the 
maximum bias and RMSE all increase to 0.13 when differential recruitment happens. s AB , on the other side, shows 
great robustness to violation of this assumption. The maximum bias and RMSE for all variables are less than 0.02 and 
0.03, respectively. 
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Regarding p bes \ $ A g B produces estimates that are closer to the true population value s* AB 62% to 74% of the 
time when sampling is with random recruitment; when sampling with differential recruitment, P best increases to 
77% -100%, revealing the superior performance of s AB over sab- 

Table 2: Statistics of estimates for s* by sab and ff£ 





Bias (standard deviation) 


RMSE (P best ) 


Random recruitment 


sab 


AB 


sab 


AB 


seed=6 age 

coupon=2 

SWOR cs . 

Pf 


.00 (.03) 
.01 (.04) 
.00 (.04) 
.00 (.04) 


.00103*) 
.00*(.02*) 
.00*(.02^ 
.00*(.02*) 


.03 (.37) 
.04 (.30) 
.04 (.26) 
.04 (.26) 


.03*(.63*) 
.02*(.70*) 
.02*(.74*) 
.02*(.74^ 


Differential recruitment 

seed-6 age • 04( -° 3) 
Seea °„ ct .09 (.03) 

bWUK pf .13 (.03) 


.01*(.03*) 
.01*(.02*) 
.02*(.02*) 
.02*(.02^ 


.05 (.16) 
.10 (.02) 
.13 (.00) 
.13 (.00) 


.03*(.84*) 
.02*(.98*) 
.03*(1.0") 
.02*(1.0*) 



* corresponding statistic is better than the other estimator. 

4.7.2. Estimates of population compositions 

The superiority of s AB over s A £ shown in the above section suggests that the RDSI eg0 estimator should also give 
less bias and error than RDSI. To confirm this, we compare the simulation results of RDSI, and RDSI eg0 to estimate 
population proportions on both the MSM network and the KOSKK networks. 

First, we take the estimates of P* A for ct as an example. The result is presented as boxplots in Figure 3, where the 
median (middle line), the 25th and 75th percentiles (box) and outliers (whiskers) are shown. When p d A ^ - 0, there is 
on average an oversample of individuals who live in Stockholm (0.05) in the raw sample; however, if adjusted, RDSI 
and RDSI eg0 all give unbiased estimates (Figure 3(a)). When p A ^ = 1, i.e., respondents are twice as likely to recruit 
friends from Stockholm rather than friends from other counties, the improvement in estimates by RDSI eg0 becomes 
much more significant. While the sample composition/7? J DS7 has a bias of 0.20/0.17 and RMSE of 0.21/0.18, the 
bias for RDSI eg0 is only 0.02 and RMSE is 0.06 (Figure 3(b)). Another notable finding is that the number of times 
an estimator provides the closest estimate is almost equal between sample composition and RDSI under random 
recruitment (P best = 0.28 for sample composition and P best = 0.29 for RDSI, see Table 3), implying that even when 
the RDS sample is collected under ideal conditions, the traditional adjusted population estimates may perform as 
poorly as the raw sample proportion. RDSI eg0 , by contrast, produces estimates closest to P* A 43% of the time. For 
sampling with differential recruitment, RDSI eg0 is far superior to the other estimators, with P best - 0.93. 




Sample RDSI RDSI ego 
(a) random recruitment 




Sample RDSI RDSI e; 
(b) differential recruitment 



Figure 3: RDS estimates for P* r Dashed line is of population value P* cl 



pdiff _ q. Participants are two times more likely to recruit friends of type A rather than friends of type B, p d ^ - 



(a) Participants recruit respondents randomly among their friends, 

1. 
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The above conclusions are similar for all other variables (see Table 3): when /? A = 0, both RDS I and RDSI eg0 
have little bias, while RDS I eg0 generates less SD and RMSE, and provides the closest estimates than the rest estimators 
10% more often. It is interesting to compare P best of sample composition with the rest of the estimators; RDSI eg0 
always has larger P best for all variables except cs, which has low homophily and a close to one activity ratio. RDS I, 
by contrast, cannot consistently outperform the sample composition. It has almost the same probability of providing 
the closest estimate to P* A as the sample composition for ct, and is even less likely to be better when estimating age 
and cs. RDSI eg0 again becomes dominant when the sampling is done with differential recruitment. The bias ranges in 
[0.00, 0.02] and RMSE in [0.04, 0.07], while for sample composition and RDS I the bias and RMSE are much larger, 
[0.07, 0.20] and [0.09, 0.21], respectively. 



Table 3: Statistics of estimates for P* A by sample mean, RDS I and RDS I ego 



Bias (standard deviation) RMSE (P best ) 



Random recruitment sample RDS I RDS I ego sample RDS I RDSI ego 

age .01 (.06) .00*007) .00 (.06*) .06 (.37) .07 (.23) .06\A0") 

_ ct .05 (.05) .00 (.06) .00*(.05*) .07 (.28) .06 (.29) .05*(.43*) 

C SWOR cs - 01 C° 3 ^ .00 (.04) .00*(.03) .03*(.51*) .04 (.17) .03 (.32) 

pf .05 (.03*) .00*(.04) .00 (.03) .05 (.20) .04 (.32) .03*(.48*) 

Differential recruitment 

age .09 (.05*) .08 (.06) .02*(.07) .10 (.10) .10 (.12) .07*(.79*) 

seed-6 ^ .20 (.06) .17(.07) .02*(.06*) .21 (.00) .18 (.06) .06*(.93*) 

COU P° n " cs .12 (.03*) .13 (.05) .02*(.04) .13 (.00) .14 (.04) .04*(.96*) 

Pf .18 (.03*) .13 (.05) .02*(.04) .18 (.00) .14 (.05) .04*(.95*) 

* corresponding statistic is better than other estimators. 

To better understand the robustness of RDSI eg0 to differential recruitment, we simulate RDS processes on the 
MSM network with p d jf^ varying from to 1. The average estimates for the four variables are shown in Figure 4. 
While the bias of RDS I increases progressively with p d jf^, RDS I eg0 shows a clear resistance over different levels of 
differential recruitment. Additionally, we can see that the magnitude of bias of RDS I does not depend solely on either 
the homophily or activity ratio, implying that, without the collection of ego network information, more sophisticated 
modifications are needed for RDS I to adapt differential recruitment. 

The complexity of joint effect of homophily and activity ratio is more evident for RDS estimates on the KOSKK 
networks, as shown in Figure 5, where the biases of both RDS I and RDSI eg0 are shown for networks with different 
levels of homophily Qia € [0, 0.5]) and activity ratio we [0.5, 2.5]. 

Generally, bias increases with homophily and difference between average degrees, however, these effects are mixed 
with impact of other network structural properties, for example the community structure resulted by the KOSKK 
model, making networks with certain combinations of and w least biased. RDS I eg0 shows resistance over all these 
structural effects: when p d ^ = 0, the bias for RDS I ranges from 0.00 to 0.06, while for RDSI eg0 , this range is only 
[0.00, 0.01]; when p d ^ - 1, the maximum bias for RDS I goes up to 0.20, while the maximum bias for RDSI ego stays 
around 0.02. 



4.2. Degree reporting error 

With the superior performance observed from the above section, we will from this section focus on RDS I eg0 and 
evaluate factors that may bring extra sources of biases. 

The degree reporting error parameters p™ lss and p™ ss , capture the fact that in social network surveys, especially 
surveys targeting hidden populations, individuals in the target population may not be identified by their friends and 
would thus be miscounted when a respondent reports the personal network size [33, 23]. This reporting error will 
not only affect the estimates of average degree, but further bias the estimate of the recruitment matrix in RDSI eg0 , 
fg(X, YeA,B). 

We simulate RDS with degree reporting error p™ lss e [0, 0.2] and p™ lss e [0, 0.2], that is, a maximum of 20% 
friends with property A or B may be unidentified as the target population. To account for the absolute worst case 
scenario, differential recruitment (jp d ^ = 1) is also included in the simulation. Results are presented in Figure 6 for 
the MSM network and Figure 7 for KOSKK networks. 
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Figure 5: Bias of RDSI and RDSI ego on KOSKK networks with random recruitment (a, c) and differential recruitment (b, d) 
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Surprisingly, on both the MSM network and KOSKK networks, even with 20% of all alters being miscounted, the 
biases of RDSI eg0 range mostly within [0.00, 0.05] with a few exceptions. The worst case scenario occurs when 20% 
of all alters from one group are missed in the reported degree, while none from the other group is missed, with the 
maximum bias around 0.07. When miscounted alters are less than 10%, most configurations of [p™ lss , p™ ss ] produce 
biases less than 0.04. 

We can also see a symmetric effect of p™ lss and p™ ss , the bias maintains on the same level as long as the two 
parameters change in the same direction. This effect was previously examined in [18], where the degree reporting 
error was modeled as unawareness of existing relationships. These findings implies that the magnitude of bias resulted 
by degree reporting error is much less than the error itself, since the increase of reporting error on one group can 
"compensate" reporting error on the other group; tolerable bias would be expected when the reporting error is limited. 

It is worth noting that the biases analyzed here are outcomes of RDS simulations with "extreme" differential 
recruitment. We have also ran simulations with random recruitment (p d jf^ = 0), which generate similar patterns (e.g., 
the symmetric effect, where the maximum bias occurs) with smaller biases, see Appendix Figure 13 and Figure 14. 

4.3. Ego network reporting error 

Another reporting error related to the implementation of RDSI eg0 , is that even when individuals fulfilling the 
sample inclusion criteria are correctly identified, their characteristics, especially for sensitive variables such as HIV 
status and sexual preference, may be incorrectly reported by their friends. By varying p e £™ r B and p e %™ r A from 0, when 
the composition of ego networks are accurately reported, to 0.2, when 20% of the alters are misclassified, we run 
simulations on the MSM networks and KOSKK networks, to evaluate the sensitivity of RDSI eg0 to the reporting 
error in ego network compositions. Similar to the previous section, we use differential recruitment and set p d jf^ = 1. 
Results are shown in Figure 8 and Figure 9. 

Contrary to the robustness to degree reporting error, the RDSI eg0 estimator is much more sensitive to the ego 
network reporting error on both the MSM network and KOSKK networks. On the MSM network, the bias readily 
excesses 0.1 as long as p d jf^ > 0.1 and p d ^ < 0.1 for age, and p d jf^ < 0.1 and p d ^ > 0.1 for ct. The biases for the 
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Figure 7: Bias of RDSI ego on KOSKK network with differential recruitment and degree reporting error. 



other two variables with less homophily are relatively smaller, as long as the misclassification error for alters of both 
groups is less than 10%. 

Given p A ™ r B e [0, 0.2] and p e B r ™ A e [0, 0.2], the ego network reporting error on KOSKK networks produces much 
larger bias for networks with low activity ratios (w < 1). And the increase of p B r ™ A is apparently more harmful than 
the increase of p A ™ B . This effect is due to the fact that when w < 1, a large amount of alters for respondents in the 
RDS sample are from group B (note also P* B - 0.7), a small probability of misclassifying B alters as A alters will result 
in a large absolute number of over-reported A alters in the end, making RDS I ego generate estimates much higher than 
the true population value P A . For this reason, variables with high activity ratios, on the other hand, are less sensitive 
to the network reporting error. 

The above reasoning can also be verified with estimates for age on the MSM network, which has a relatively 
balanced activity ratio (w = 1.05), but a population proportion of 70%. Therefore, reporting error regarding the group 
with higher population proportion and activity ratio will result in substantial amount of misclassified alters in the ego 
networks and greatly affect the estimates. 

Simulations with random recruitment have also been carried out, however the ego network reporting error seems 
to be the dominant factor driving estimate error for RDS eg0 , no significant reduce of bias is observed, see Appendix 
Figure 15 and Figure 16. 

5. Conclusion and discussion 

Ego network data has been collected for decades and exists largely in sociological surveys [28, 34, 35, 36, 37, 38]; 
the RDS sampling mechanism further makes it possible to collect "linked-ego network" data. By combining RDS 
recruitment trees with ego networks, this study developed a new estimator, RDSI eg0 , for RDS studies. Given that 
participants can accurately report the composition of their personal networks, this estimator has superior performance 
over traditional RDS estimators. Most importantly, RDSI eg0 shows strong robustness to differential recruitment, a 
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violation of the RDS assumptions that may cause large bias and estimation error and is not under the control of the 
researchers. Evaluation studies on our simulated KOSKK networks also show that RDSI eg0 performs consistently 
well on networks with varying homophily, activity ratio, and community structures. 

The limitation of RDS I ego is rooted in the need to collect ego network data. Many RDS studies are designed for use 
among hidden populations, who may be reluctant to share certain private information with their friends. Consequently, 
the proposed method is primarily suited for less sensitive variables, which the respondent can be expected to know 
about his contacts. Such information may for example include socio-demographic variables (e.g., gender, age groups, 
profession, marital status, etc.) for which survey methods on how to design and collect ego network data has been 
extensively studied [39, 40, 41, 42]. Additionally, certain variables, e.g. drug use, may be highly sensitive in the 
general population but may not be at all be so in an IDU population. 

By modeling the difficulty in understanding of personal network composition as degree reporting error and ego 
network reporting error, which quantify the level of mutual knowledge about studied variables shared with friends, we 
have showed that even with 20% of alters being unidentified, RDS I eg0 was still able to produce estimates with bias 
less than 0.05 most of the time. On the other hand, RDSI eg0 is sensitive to the error of misclassifying alters. If 20% 
of alters from one group is mistakenly reported as belonging to the other group, estimate bias can exceed 0.1 when 
the probability of misclassifying members of one group is substantially larger than misclassification of members in 
the other group (e.g., ]?™ B » Pb™D' Fortunately, the result shows that when the studied variables only related to a 
small proportion of alters, that is, if P* A is low and w is relatively small, the increase of error in misclassifying A as B 
members will have a small influence on the bias. Consequently, for many sensitive variables surveyed in RDS studies, 
if the reporting error of a low prevalence trait (e.g., HIV status) is mainly "false negatives", e.g., alters with HIV are 
reported as healthy friends since they are reluctant to reveal this information to their egos, estimates with small bias 
are still expected to be able to achieve. 

There are other interesting findings from this study. First, the performance of RDS /, which has been used in most 
RDS studies so far, fails to outperform the sample composition in many simulation settings. Second, we propose in this 
paper a new bootstrap method for constructing confidence intervals (CIs) with RDS I eg0 (see Appendix). Simulations 
in this paper and recent studies [21, 17, 43], has shown that the traditional bootstrapping method underestimates 
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Figure 9: Bias of RDSI ego on KOSKK network with differential recruitment and ego network reporting error. 



variance. However, the proposed bootstrap method in this paper is able to generate CIs that much better approximate 
the expected coverage rates and performs fairly consistent to variations of homophily, activity ratio and community 
structures of networks. 

In summary, we have shown that, by combining the traditional RDS sampling design with collection of ego 
network data, population estimates can improve drastically. What's most important, since RDS is a chain-referral 
designed sampling strategy, once the sample is started from seeds, the distribution of coupons is largely out of the 
control of researchers, and non-random recruitment often occurs, which has been proved to generate large estimate 
bias and error [19, 18, 16, 44]. The robustness of RDSI eg0 to differential recruitment offers researchers the ability to 
largely reduce estimate error. Additionally, by comparing S eg0 with the observed raw sample recruitment matrix S , the 
severity of differential recruitment may be assessed. For future RDS studies, we encourage ego network questions to 
be integrated with traditional RDS questionnaires along with the improved bootstrap procedure. Due to the limitations 
inherent in the collection of sensitive variables from stigmatized group, the new method may be better suited to less 
sensitive variables. This new method is also applicable to sampling problems in other fields [15, 45, 46], such as 
sampling of internet contents from which the ego network data is more reliable and may be more efficiently retrieved. 
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Appendix 

Appendix A: Generation process for KOSKK networks 

As one of the dynamical network evolution models, the KOSKK model utilizes network link weights to generate 
networks with key common feathers of social networks [32]: (i) skewed degree distribution, (ii) assortative mixing, 
(iii) high average clustering coefficient, (iv) small average shortest path lengths, and (v) community structures. In a 
comprehensive comparative study [31], the KOSKK model was found to be one of the best social network models 
that can generate similar- to-real social network structures, among nodal attribute models, network evolution models 
as well as ERGM models. 

In a KOSKK model, the network is initiated with N nodes and zero edges, and then evolved with three mecha- 
nisms: 

(i) Local attachment. Select a node i randomly, and choose one of z's neighbor j with probability w t j/ Y,j w ij> 
where is the weight on link etj. If j has another neighbor apart from i, choose one of them (node k) with probability 
w jkl Zfc ( w jk - Wij). If there is no link between i and k, connect k to i with probability p& and set w& = wq. Increase 
link weight wy, Wjk, and w& (if was already present) by S. 

(ii) Global attachment. Connect i to a random node / with probability p r (or with probability 1 if i has no connec- 
tions) and set wu - wo- 

(iii) Node deletion. Select a random node and with probability pd remove all of its connections. 

With larger S, clearer community structures will be generated, as new links are created preferably through strong 
links. When pd is fixed, the average degree is obtained by adjusting p A for each 6. In our simulation, we set N = 
10000, wo = 1, p r = 0.0005, pd = 0.001, S = 0.6, and the network average degree S* = 10. The process was ran 10 8 
time steps to achieve stationary network characteristics. At the end of the process, a few nodes will be isolated due 
to the node deletion step, we simply randomly link these nodes to the giant connected component to make sure all 
nodes in the network are connected. As S is relatively large, the obtained network shows a clear community structure, 
see Figure 10. 

Based on the above network, we then start the configuration of homophily and activity ratio. Let w be the activity 
ratio of the current network and w* be the activity ratio we want to obtain. At the beginning, 30% of the nodes are 
randomly selected and assigned with property A, the rest of nodes are then assigned with property B. If w > w*, we 
randomly pick a node with property A, i, and a node with property B, j, if d[ > dj, we then exchange the properties of 
the two nodes, i.e., i becomes a B node, and j becomes a A node. If w < w*, we exchange the properties of i, j only 
when di < dj. The above process is repeated until w = w*. 

For each of the network configured with w* 9 we use a rewiring process to adjust the homophily. Recall that the 
homophily is depended on the number of cross group links as h& - 1 - s* AB /P* B , smaller sab indicates high homophily. 
Let hp, be the homophily of the current network and h* A be the desired value, if h^ > h* A , we randomly pick two within 
group links i <-> j, k <-> /, with i, j belonging to group A, and k, I belonging to group B, and rewire them to i <-> k, 
j <-> /, to increase cross group links. Similarly, if h& < h* A , we randomly pick two cross group links and rewire them 
to form two within group links. The above process is repeated until = h\. 
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Figure 10: Visualization of the KOSKK network generated with 6 = 0.6, D* = 10. 
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Appendix B: Confidence interval estimation 

The precision of a sample estimate is usually enhanced by providing a confidence interval (CI), which gives a range 
within which the true population is expected to be found with some level of certainty. Due to the complex sample 
design of RDS, simple random sampling based CIs are generally narrower than expected [17, 11, 20]. Consequently, 
bootstrap methods are used to construct CIs around RDS estimates. 

The current widely used bootstrap procedure for RDS (BS -origin) was proposed by Salganik [20, 47]. In this 
procedure, respondents are divided into two groups depending on the property of their recruiters, that is, those who 
are recruited by A nodes (A rec ), and those who are recruited by B nodes (B rec ). Then the bootstrap starts by a randomly 
chosen respondent. If the respondent has property A, then the next respondent is randomly picked from A rec , otherwise 
from B rec . Such a procedure is repeated with replacement until the original RDS sample size is reached, then the RDS 
estimate is calculated based on the replicated sample. When 7?-replicated samples are bootstrapped, the resulting 
middle 90%/95% estimates from the ordered R estimates are then used as the estimated CI. 

We extend the BS -origin in two different ways: 

(a) BS-egol\ we implement the same resampling procedure as with BS -origin; however, when each replicated 
sample is collected, RDS I eg0 is used to calculate the RDS estimate, rather than RDS I\ 

(b) BS-ego2\ we divide the sample into two groups depending on the property of the respondents, that is, those 
with property A (A set ) and those with property B (B set ). Then the bootstrap procedure is started with a randomly 
picked respondent. If the respondent has property A, then the probability of selecting the next respondent from A set 
or B seU is 1 - and f^°, respectively. If the respondent has property B, then the probability of selecting the next 
respondent from A set or B set , is and 1 - f£°, respectively. The above process is repeated until the same size as 
original sample is reached. RDSI eg0 is then used to calculate the RDS estimate for each replicated sample. 

We expect that the modification in the bootstrap procedure of BS-ego2 by introducing the ego network data 
based estimate and s^a can i m P rove me performance of estimated CIs when the RDS is done with differential 
recruitment. 

Following [20], we use simulations on both the MSM network and KOSKK networks to compare the performance 
of BS -origin, BS-egol, and BS-ego2. For each variable, 1000 RDS samples are collected, and for each of these 
1000 samples we construct the 90% and 95% CIs based on 1000 replicate samples drawn by the above bootstrap 
procedures. The proportion of times the generated confidence interval contains the true population value P* A when 
sampling with random recruitment and differential recruitment (denoted as O^, O^, and O^, is compared 
with different bootstrap methods and are presented in Figure 1 1 and Figure 12 . 

On the MSM network, when sampling with random recruitment, we can see from Figure 11(a), (b) that all three 
methods produce similar coverage rates for the tested variables. The coverage rate for age is significantly smaller 
than the desired value for both and O^, indicating that even under ideal conditions, the bootstrap-based CIs in 
RDS may be much narrower than expected. When the RDS is done with differential recruitment (Figure 11(c), (d)), 
the coverage rate of BS -origin becomes extremely small and practically useless. This is because the RDS I estimates 
are largely biased from the true population value when differential recruitment exists. The coverage rates of BS-egol 
and BS-ego2, on the other hand, are well above 50% for all the four variables and therefore outperform BS -origin in 
an absolute sense. In general, there is 5% -10% more coverage in <S>^ R and for BS-ego2 compared to BS-egol, 
implying that the modified bootstrap procedure is more resistant to the violation of the random recruitment assumption 
in RDS. 

BS -origin performs poorly on KOSKK networks for both sampling with random recruitment and sampling with 
differential recruitment, with a majority of 95% coverage rates under 50%. The RDS P ' §0 -based bootstrap methods, all 
produce coverage rates 20%~60% higher than BS -origin. When p d ^ - 0, there is no significant difference between 
BS-egol and BS-ego2, however, when p d ^ - 1, BS-ego2 is able to produce 8%~14% higher coverage rates than 
BS-egol in extreme cases (w = 0.5). 

It is worth noting that, even BS -ego2 shows superior performance over BS -origin and is robustness to variations 
in network structure properties evaluated in this study (e.g., homophily, activity ratio, and the like.), the bootstrapped 
CIs rarely approach required coverage rates. On KOSKK networks, it is common that the 95% coverage rates are 
5%~20% lower than expected. Even the community structure in these networks may impede the performance of RDS 
estimates as well as the bootstrap methods, future work is needed to develop CI estimate methods with improved 
precision. 
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Figure 11: Coverage rate of 90% and 95% confidence interval, by bootstrap procedure BS-origin, BS-egol, and BS-ego2 on the MSM network. 
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Figure 12: Coverage rate of 95% confidence interval, by bootstrap procedure BS-origin, (BS-egol), and [BS-ego2] on KOSKK networks, (a) 
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Figure 14: Bias of RDSI ego on KOSKK network with random recruitment and degree reporting error. 





Figure 16: Bias of RDSI ego on KOSKK network with random recruitment and ego network reporting error. 
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